Please note: this project was originally a group project for UPenn's CIS 545: Big Data Analytics. The following notebook was my (Allison Kahn) contribution to the project. As such, this will not deal with any of the preprocessing steps for cleaning the data, nor exploring the raw data.

Hello!

This project will be centered around automatic music playlist continuation (APC), which is a form of the more general task of sequential recommendation. As the digital music industry continues to expand and transform, it is important to recognize its influence not only on the livelihoods of its artists, but on the actions and attitudes of the general public, and on culture writ large. Though automatic playlist continuation is but one small aspect of the digital music revolution, it promises to vastly improve audio streaming services’ ability to curate and deliver content recommendations at scale.

Given a Spotify playlist of arbitrary length, our objective will be to generate track recommendations that fit the target characteristics of the original playlist. For more details on the challenge description: https://research.atspotify.com/the-million-playlist-dataset-remastered/

In [ ]:
%%capture
import pandas as pd
from scipy import sparse
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
import numpy as np
import seaborn as sns
from pandas.api.types import CategoricalDtype
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.preprocessing import MinMaxScaler
!pip install pyspark
from pyspark.mllib.linalg.distributed import RowMatrix
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

Collaborative Filtering with kNN

We will begin with collaborative filtering. Collaborative filtering follows the intuition that if you find someone else who listens to similar songs as you, they might be able to give you good recommendations for songs you might like. Following this thought process, collaborative filtering aims to exclusively find similar playlists through common songs, without taking any information about the songs into account.

Preparing the Data

Our first step is to create a sparse matrix that has all of the artists in our dataset on one axis and all of the songs in our dataset on the other axis. If that song exists in the corresponding playlist, that value would equal one.

In [ ]:
df = pd.read_csv('imputed_dense20.csv')
In [ ]:
# split into train and test, stratifying along pid to ensure some songs from each playlist are in test
df_unique = df[['pid', 'track_uri']].drop_duplicates()
train_dense, test_dense = train_test_split(df_unique, test_size=0.25, random_state=42, stratify=df_unique['pid'])
test_dense = test_dense[test_dense['track_uri'].isin(train_dense['track_uri'])]

We're using a 75/25 split here because, as specified in the challenge description and the papers cited, we then need to remove any songs in the test set that don't appear in the training set. This leaves us with about a 20/80 split. Our train and test set is created by splitting each playlist and then predicting songs based off the ones in the training set. Once we have our predicted songs, we can see how successful we were.

In [ ]:
# create lookup table to connect pid and index value later on
pid_to_index = train_dense['pid'].drop_duplicates().sort_values().reset_index()
In [ ]:
# create a sparse matrix with entry for each combination of playlist and song

pid_cat = CategoricalDtype(sorted(train_dense.pid.unique()), ordered=True)
row = train_dense.pid.astype(pid_cat).cat.codes

track_cat = CategoricalDtype(sorted(train_dense.track_uri.unique()), ordered=True)
col = train_dense.track_uri.astype(track_cat).cat.codes

data = len(train_dense)*[1]

playlist_song_matrix_sparse = sparse.csr_matrix((data, (row, col)), shape=(pid_cat.categories.size, track_cat.categories.size))
playlist_song_matrix_sparse
Out[ ]:
<5071x99262 sparse matrix of type '<class 'numpy.longlong'>'
	with 288562 stored elements in Compressed Sparse Row format>

Training the Model

Our next step is to take this matrix and then run it through a k-Nearest Neighbors classification algorithm. After some experimentation, we decided on utilizing a euclidean distance metric. From there, for each target playlist we can find similar playlists and obtain any songs that aren't in the target playlist to use as predicted songs.

In [ ]:
#train k-Nearest Neighbors

knnModel = NearestNeighbors(metric='euclidean')
knnModel.fit(playlist_song_matrix_sparse)
Out[ ]:
NearestNeighbors(metric='euclidean')
In [ ]:
def getNumToPredict(pid, test_dense, k):
  num_in_test = len(test_dense[test_dense['pid'] == pid]['track_uri'])
  return num_in_test*k

def kpredict(model, pid, test_dense, k):
    #find number of songs to duplicate
    num_to_predict = getNumToPredict(pid, test_dense, k)

    #get index of songs in current playlist
    tracks_in_pid = list(train_dense[train_dense['pid'] == pid]['track_uri']) 
    
    #get row of values for playlist
    index_of_pid = pid_to_index[pid_to_index['pid'] == pid].index
    pid_values = playlist_song_matrix_sparse[index_of_pid].toarray()

    #find 100 most similar playlists 
    ind = model.kneighbors(pid_values, n_neighbors = 100, return_distance=False)
    recommened_playlists_ind = list(ind[0])
    output = []

    #for each song in similar playlist, add song if it is not already in playlist; break when hit desired # of songs
    for i in recommened_playlists_ind:
        current_pid = pid_to_index.iloc[i]['pid']
        potential_songs = list(train_dense[train_dense['pid'] == current_pid]['track_uri'])
        for song in potential_songs:
            if song not in tracks_in_pid:
                output.append(song)
            if len(output) == num_to_predict:
                break
        if len(output) == num_to_predict:
            break
    
    return output

Calculating Performance

After we generate our predicted songs, it's time to measure how we did. We are utilizing an r-precision metric as suggested in the Spotify Million Playlist challenge.

An r-precision score measures how many of the reserved songs (the “test set”) are in the list of predicted songs. For instance, if there are 10 songs that were reserved for the test set, and 8 of them are in our predicted songs list, we would have an r-precision score of 0.8

The mathematical representation of this is: $$ r-precision = \frac{|\begin{Bmatrix}recommended\:tracks \end{Bmatrix} \bigcap \begin{Bmatrix}heldout\: tracks\end{Bmatrix}| }{|\begin{Bmatrix}heldout\: tracks \end{Bmatrix}| } $$

This value is influenced by the number of predicted songs you create--the more songs you have, you're more likely to find the songs you're looking for. To mitigate this, throughout our models we are creating 15 predictions per song in the test set.

In [ ]:
def r_precision(prediction, test_set):
    score = np.sum(test_set.isin(prediction))/test_set.shape[0]
    return score
In [ ]:
r_precision_list = []
r_precision_dict = {}

listOfPIDs = list(train_dense['pid'].unique())
i = 0

for pid in listOfPIDs:
    pid = listOfPIDs[i]
    predicted_songs = kpredict(knnModel, pid, test_dense, 15) 
    withheld_songs = test_dense[test_dense['pid'] == pid]['track_uri']
    if len(withheld_songs) == 0:
      i+=1
      r_precision_dict[pid] = 0
      continue
    else:
      r_val = r_precision(predicted_songs, withheld_songs)
      r_precision_dict[pid] = r_val
      i+=1

    #error checking
    if isinstance(np.float64(r_val), np.floating) and np.isnan(r_val) == False:
      r_precision_list.append(r_val)

    if i % 500 == 0:
      print(round(i*100/len(listOfPIDs)), "%")

    #if i == 500:
     # break
    
   
    
10 %
20 %
30 %
39 %
49 %
59 %
69 %
79 %
89 %
In [ ]:
np.mean(r_precision_list)
Out[ ]:
0.14164912553509645

Exploration of Collaborative Filtering

Over our dataset, we end with an r-precision of 14.2%, a suprisingly successful result considering we are only looking at what other similar playlists are listening to, regardless of any data on the songs themselves.

An Example

From here, we can look a little at our results. First, we can examine an example playlist and see what recommended songs would be curated for the playlist.

In [ ]:
#sample playlist to continue
sample_pid = 1

#create list of predictions
predictions = kpredict(knnModel, sample_pid, test_dense, 12)
len(predictions)

#find withheld songs in playlist to test against
test_set = test_dense[test_dense['pid'] == sample_pid]['track_uri']

#find r-precision
r_precision(predictions, test_set)
Out[ ]:
0.375
In [ ]:
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud


word_could_dict=Counter(list(df[df['pid'] == sample_pid]['artist']))
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

word_could_dict_pred = Counter(list(df[df['track_uri'].isin(predictions)]['artist']))
wordcloud_pred = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict_pred)



plt.figure(figsize=(10,5))
plt.imshow(wordcloud)
plt.title("Artists in Playlist")
plt.show()

plt.figure(figsize=(10,5))
plt.imshow(wordcloud_pred)
plt.title("Artists in Predicted Songs")
plt.show()

This playlist seems to be predominatly classic rock, featuring bands like Rush, Boston and Led Zeppelin. While we see some of that in our recommended songs, with bands like Queen and Aerosmith, we also see a few rap artists like Big Sean and Future. From this, I would assume that there are potentially a large number of playlists that contain both classic rock and rap, a suprising combination. This would be a potentially interesting future step if we obtain data on the genre of each song or album. Overall, this playlist received an r-precision score of 0.375, a fairly successful example.

General Characteristics

One question I'm curious about is how different playlists result in different r-precisions--namely, do longer playlists fare better? What about more diverse playlists?

In [ ]:
# create dataframe of all r-precision results
precision_df = pd.DataFrame.from_dict(r_precision_dict, orient='index')
precision_df = precision_df.reset_index()
precision_df = precision_df.rename(columns={'index': 'pid', 0: 'R-Precision'})

# get length of data playlist
df_pid_len = df.groupby(by='pid').size()
df_pid_len = df_pid_len.to_frame()
df_pid_len = df_pid_len.reset_index()
df_pid_len = df_pid_len.rename(columns={0: 'length'})
In [ ]:
precision_length_df = precision_df.merge(right=df_pid_len, how='left', left_on='pid', right_on='pid')

g = sns.jointplot(x="length", y="R-Precision", data=precision_length_df,
                  kind="reg", truncate=False)
g.fig.suptitle("R-Precision vs Playlist Length")

g
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x7f35da026a50>

From this, we can see that longer playlists do tend to have higher r-precisions. We can also see that the y-axis is dominated by a high number of 0 values, likely because of the limited nature of our dataset. Statistically, we would expect most playlists to be dominated by popular songs with a few more esoteric playlists along the edges. With more playlists, I would expect fewer playlists out on the edges, and thus, fewer 0 values.

Next, we can look at how variety in a playlist influences r-precision:

In [ ]:
num_artists = df.groupby('pid')['artist'].nunique().to_frame()
num_songs = df.groupby('pid')['track_uri'].nunique().to_frame()

artist_song_index = num_artists.merge(num_songs, how='inner', on='pid')
artist_song_index['index'] = artist_song_index['track_uri']/artist_song_index['artist']

precision_artist_song_index_df = artist_song_index.merge(right=precision_df, how='inner', left_on='pid', right_on='pid')
precision_artist_song_index_df.head()
Out[ ]:
pid artist track_uri index R-Precision
0 0 37 51 1.378378 0.153846
1 1 21 39 1.857143 0.375000
2 2 31 64 2.064516 0.000000
3 3 86 126 1.465116 0.000000
4 5 56 75 1.339286 0.058824
In [ ]:
g = sns.jointplot(x="index", y="R-Precision", data=precision_artist_song_index_df,
                  kind="reg", truncate=False)
g.fig.suptitle("R-Precision vs Unique Artist Index")

g
Out[ ]:
<seaborn.axisgrid.JointGrid at 0x7f35d15cc990>

Suprisingly, when we use an index of the number of unique songs to the number of unique artists in a playlist, it does not seem to have a large influence on r-precision. My best guess for why this would be is that playlists don't necessarily need to have repeat artists in order to be identifiable; in other words, similar playlists are similar regardless of how often artists make repeat apperances.

Content Based Filtering with Cosine Similarity

Content Based Filterings is structured around the idea that if you can measure how similar two songs are, you can find the closest match for any songs. Thus, for all songs in our playlist, we can find songs that have similar features and would be good candidates to recommend.

Preparing the Data

In [ ]:
#split into train and test
train, test = train_test_split(df, test_size=0.25, random_state=42, stratify=df['pid'])
test = test[test['track_uri'].isin(train['track_uri'])]

print(len(train), len(test))
292257 75439

As above, we're using a 75/25 split here because, as specified in the challenge description and the papers cited, we then need to remove any songs in the test set that don't appear in the training set. This leaves us with a 20/80 split. Also as above, we are seperating each playlist into a training and test set so that we can use the test set to measure how well our predictions match.



In order to simplify the complexity of the task, we're going to be focusing on songs that appear in multiple playlists. We decided on cutting out songs that don't appear in more than 20 playlists to increase our chances of finding popular recommendations. It is important to note, however, that we have not removed these songs from our test set, just to keep things fair.

In [ ]:
#only want to use songs that have appeared in at least 20 playlists
train_popular_tracks = train.groupby(by='track_uri').size()
train_popular_tracks = train_popular_tracks.to_frame()
train_popular_tracks = train_popular_tracks.reset_index()
train_popular_tracks = train_popular_tracks.rename(columns={0: 'count'})
train_popular_songs_df = train[train['track_uri'].isin(train_popular_tracks[train_popular_tracks['count'] > 20]['track_uri'])]


# final train data cleaned with just the columns we will use-- distinct based on track data but not distinct by song
train_content = train_popular_songs_df[['song_popularity', 'release_date', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo']].drop_duplicates()
train_content = train_content.reset_index().reset_index()
In [ ]:
# Create tables to reconnect fields later on

# track info to track uri 
track_uri_to_content = train_popular_songs_df[['track_uri','song_popularity', 'release_date', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo']].drop_duplicates()

# connect index of training data to the song identifier since multiple songs can be represented by same track data
index_to_uri = train_content.merge(right=track_uri_to_content, how="right", on=['song_popularity', 'release_date', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo'])[['level_0','track_uri']]

train_content.shape
Out[ ]:
(2245, 15)

Here, we're scaling our data in order to keep all values on the same level. We are also starting up a pySpark instance to speed up the computation.

In [ ]:
# scale training data
scaler = MinMaxScaler()
scaler.fit(train_content)
train_content_scaled = scaler.transform(train_content)

# check still same size as before scaling
train_content_scaled.shape
Out[ ]:
(2245, 15)
In [ ]:
# set up pySpark session

spark = SparkSession.builder.appName('final_project').getOrCreate()
sparkContext=spark.sparkContext

Fill Out The Matrix

Our next step is to create the matrix to utilize the cosine distance between each song to calculate the similarity between each combination of songs. Because we have 2245 distinct songs, this results in C(2245,2) = 2,518,890 comparions.

Cosine difference is defined as: $$ cos(track_{1}, track_{2}) = \frac{track_{1} \cdot track_{2}}{\left \| track_{1} \right \| \cdot \left \| track_{2} \right \|} $$

and measures distance--since we're measuring similarity, we'll use $$similarity = 1-cos(track_{1}, track_{2})$$

In [ ]:
# create matrix and calculate column similarity 
rows = sparkContext.parallelize(train_content_scaled.T)
mat = RowMatrix(rows)

# Calculate exact similarities
exact = mat.columnSimilarities()

# check correct size
print(mat.numRows(),mat.numCols())
print(exact.numRows(),exact.numCols())
15 2245
2245 2245
In [ ]:
# convert to spark dataframe
sdf = exact.entries.toDF() #~4 minutes
sdf.show()
/usr/local/lib/python3.7/dist-packages/pyspark/sql/context.py:127: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
  FutureWarning
+----+----+------------------+
|   i|   j|             value|
+----+----+------------------+
| 233|1265|0.6477859565023192|
| 220| 977|0.7942795417794847|
| 953|2112|0.9153798182511237|
| 883| 974|0.5107142023792293|
| 372|2024|0.7047977619537767|
| 379|1752|0.7658811272068164|
|1137|1530|0.7866206837207408|
| 918|1768|0.8755069510926992|
|1446|1755|0.9250128288268059|
| 918|2162|0.9305726059471695|
| 973|2118| 0.839199049273362|
|1487|1888|0.9542050399957487|
| 320| 788|0.8214136594375285|
| 750|1154|0.7872566135177153|
|1100|1656|0.5565475405159468|
|1818|1883|0.7753910890907202|
| 523| 843|0.9587696807506574|
| 593|1527|0.8909555125413827|
| 918|1737|0.8767936304135799|
| 464| 640|0.7390427716431779|
+----+----+------------------+
only showing top 20 rows

In [ ]:
sdf.persist()
Out[ ]:
DataFrame[i: bigint, j: bigint, value: double]

For each song, we want to find 10 candidates with the highest simiarities. Once we join these candidates back up with the original data, we can rank each option and choose the best recommendations per playlist.

In [ ]:
# for each song, find the 10 songs with the highest similarities

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("i").orderBy(col("value").desc())
sdf2 = sdf.withColumn("row",row_number().over(windowDept))

sdf2 = sdf.withColumn("row",row_number().over(windowDept))
max_10_j_sdf=sdf2.filter(col("row") <= 10)
In [ ]:
#find the corresponding uri to each predicted song
index_to_uri_pred_sdf = index_to_uri.rename(columns={'level_0':'pred_index', 'track_uri': 'predicted_track'})
index_to_uri_pred_sdf = spark.createDataFrame(index_to_uri_pred_sdf) 
sdf_with_predicted = max_10_j_sdf.join(index_to_uri_pred_sdf, max_10_j_sdf.j == index_to_uri_pred_sdf.pred_index, 'left')


#find the corresponding uri to each origin song
index_to_uri_from_df = index_to_uri.rename(columns={'level_0':'from_index', 'track_uri': 'from_track'})
index_to_uri_from_sdf = spark.createDataFrame(index_to_uri_from_df) 
final_sdf = sdf_with_predicted.join(index_to_uri_from_sdf, sdf_with_predicted.i == index_to_uri_from_sdf.from_index, 'left').select('from_track', 'predicted_track', 'value')
final_sdf.show()
+--------------------+--------------------+------------------+
|          from_track|     predicted_track|             value|
+--------------------+--------------------+------------------+
|spotify:track:5IM...|spotify:track:2gE...|0.9843507410289587|
|spotify:track:5IM...|spotify:track:7oO...|0.9643521616167579|
|spotify:track:5IM...|spotify:track:4EE...| 0.963268532513836|
|spotify:track:5IM...|spotify:track:4o6...|0.9614421293994869|
|spotify:track:5IM...|spotify:track:4wC...|0.9599156133730117|
|spotify:track:5IM...|spotify:track:7s2...|0.9585288021175584|
|spotify:track:5IM...|spotify:track:6zs...|0.9544836347820035|
|spotify:track:5IM...|spotify:track:6Fe...|0.9537015870542584|
|spotify:track:5IM...|spotify:track:2La...|0.9521133558732151|
|spotify:track:5IM...|spotify:track:3Lm...|0.9487958922056672|
|spotify:track:52Q...|spotify:track:5Fi...|0.9845068055056349|
|spotify:track:52Q...|spotify:track:2mX...|0.9702152445612933|
|spotify:track:52Q...|spotify:track:4XT...|0.9677628270344563|
|spotify:track:52Q...|spotify:track:6Tl...|0.9670193073272657|
|spotify:track:52Q...|spotify:track:7KO...|0.9654370036778122|
|spotify:track:52Q...|spotify:track:55O...|0.9654081452865069|
|spotify:track:52Q...|spotify:track:5tz...| 0.964002175546157|
|spotify:track:52Q...|spotify:track:0ES...|0.9631376121089872|
|spotify:track:52Q...|spotify:track:7eq...|0.9629954115519099|
|spotify:track:52Q...|spotify:track:072...|0.9628862707148248|
+--------------------+--------------------+------------------+
only showing top 20 rows

In [ ]:
# for each song in each playlist, find its closest match to use as the "predicted" song and convert to pandas df
track_songs_df = train[['pid', 'track_uri']]
track_songs_sdf = spark.createDataFrame(track_songs_df) 

pred_tracks = track_songs_sdf.join(final_sdf, final_sdf.from_track == track_songs_sdf.track_uri, 'left').select('pid', 'predicted_track', 'value')
print((pred_tracks.count(), len(pred_tracks.columns)))

pred_tracks_df = pred_tracks.toPandas()
pred_tracks_df.head()
(1115695, 3)
Out[ ]:
pid predicted_track value
0 68 None NaN
1 3374 spotify:track:4Sfa7hdVkqlM8UW5LsSY3F 0.974017
2 3374 spotify:track:6E9V9TRlVOLjenGjHemzEH 0.980654
3 3374 spotify:track:0p1HtkrNYxv0iDfEKwXSTp 0.986127
4 3374 spotify:track:37F0uwRSrdzkBiuj0D5UHI 0.982394

Calculate Performance

As discussed above, we will be using r-precision to calculate our success. We will also be using a factor of 15 to choose how many songs to recommend, ensuring that our r-precision scores are comparable.

In [ ]:
# for each playlist, get the specified number of recommended tracks, prioritizing those with the highest match score
def getPredicted(pred_tracks_df, pid, num_to_predict):
  return pred_tracks_df[pred_tracks_df['pid'] == pid].nlargest(num_to_predict, 'value')['predicted_track']
In [ ]:
# calculate r precision between predicted songs and withheld songs

r_precision_list = []
r_precision_dict = {}
listOfPIDs = list(pred_tracks_df['pid'].unique())

for i in range(len(listOfPIDs)):
    pid = listOfPIDs[i]
    withheld_songs = test[test['pid'] == pid]['track_uri']
    predicted_songs = getPredicted(pred_tracks_df, pid, len(withheld_songs)*15)
    if len(withheld_songs) == 0:
      continue
    
    r_val = r_precision(predicted_songs, withheld_songs)

    r_precision_dict[pid] = r_val

    #error checking
    if isinstance(np.float64(r_val), np.floating) and np.isnan(r_val) == False:
      r_precision_list.append(r_val)

    if i % 500 == 0:
      print(round(i*100/len(listOfPIDs)), "%")

    
0 %
10 %
20 %
30 %
39 %
49 %
59 %
69 %
79 %
89 %
99 %
In [ ]:
# average r precision as final score
np.mean(r_precision_list)
Out[ ]:
0.058688212064975986

We end with an r-precision of 5.7%, a much lower score than calculated in the collaborative filtering method. From here, we can explore the results.

Exploration of Content Filtering

An Example

First, we can look at our most closely matched duos and see what type of songs are paired together.

In [ ]:
final_sdf.createOrReplaceTempView('final_sdf')

query = 'select * from final_sdf order by value desc limit 500'
#query = 'select count(*) from final_sdf where from_track is null'
example = spark.sql(query).toPandas()
example.head()
Out[ ]:
from_track predicted_track value
0 spotify:track:53Dj5PCDhb22qWqmre3YQs spotify:track:2rg3yLJKN5Yl4JCHHkMgeC 0.998578
1 spotify:track:0ct6r3EGTcMLPtrXHDvVjc spotify:track:1mMLMZYXkMueg65jRRWG1l 0.997407
2 spotify:track:1zWZvrk13cL8Sl3VLeG57F spotify:track:6zQyu8L8yUuJkl6LbQ6iKU 0.997182
3 spotify:track:0iA1unTbTbDOWUSlbwJ1pS spotify:track:3I7krC8kr0gFR7P6vInR1I 0.997159
4 spotify:track:5NLuC70kZQv8q34QyQa1DP spotify:track:05RgAMGypEvqhNs5hPCbMS 0.996827
In [ ]:
example_details_top_5 = example[:5].merge(df, how='inner', left_on='from_track', right_on='track_uri').merge(df, how='inner', left_on='predicted_track', right_on='track_uri')

example_details_top_5_clean = example_details_top_5[['track_x', 'artist_x', 'track_y', 'artist_y']].drop_duplicates().rename(columns={'track_x': 'from_track', 'artist_x': 'from_artist', 'track_y': 'pred_track', 'artist_y': 'pred_artist'})
example_details_top_5_clean.head()
Out[ ]:
from_track from_artist pred_track pred_artist
0 Aw Naw Chris Young Time Is Love Josh Turner
930 The Nights Avicii When It Rains It Pours Luke Combs
2119 T-Shirt Thomas Rhett Cardiac Arrest Bad Suns
4171 It Don't Hurt Like It Used To Billy Currington Do I Make You Wanna Billy Currington
6033 Fight For Your Right Beastie Boys Panama - 2015 Remastered Version Van Halen

Some of these make sense-- for instance, the first example connects two country songs and the last example connects the Beastie Boys with van Halen. There is even an example of a connection between two songs by the same artist.

Interestingly, it also connects another country song with an Avicii song. When I think of Avicii, I tend to think of his early EDM-focused work but upon listening to the song again, I did notice some country influences, and in fact The Observer notes that the artist "tapped into the market potential of mixing EDM and country, a template many artists have since recreated".

The only mismatch I can see is between country artist Thomas Rhett's 'T-Shirt' and Bad Sun's alternative rock song 'Cardiac Arrest'.

General Characteristics

First, we can look at what aspects of two songs are likely to indicate a high similarity

In [ ]:
example_details_large = example.merge(df, how='inner', left_on='from_track', right_on='track_uri').merge(df, how='inner', left_on='predicted_track', right_on='track_uri')

x_columns = ['release_date_x', 'track_duration_ms_x',
      'song_popularity_x', 'danceability_x', 'energy_x',
       'key_x', 'loudness_x', 'mode_x', 'speechiness_x', 'acousticness_x',
       'instrumentalness_x', 'liveness_x', 'valence_x', 'tempo_x']

y_columns = ['release_date_y','track_duration_ms_y', 'song_popularity_y',
       'danceability_y', 'energy_y', 'key_y', 'loudness_y', 'mode_y',
       'speechiness_y', 'acousticness_y', 'instrumentalness_y', 'liveness_y',
       'valence_y', 'tempo_y']

t = example_details_large[x_columns + y_columns].drop_duplicates().corr()[x_columns]
sns.heatmap(t[t.index.isin(y_columns)])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f35d1412610>

It seems like, generally speaking, what we would expect. Among pairs of songs that are the most similar, the have more correlated with their corresponding value of the same feature. It looks like the most closely joined values are the key of the songs, the song's popularity and the valence of a song, defined by Spotify as "the musical positiveness conveyed by a track".

Among the least valued appear to be the track's duration and instrumentalness. This last one in particular suprised me as I would have thought that instrumental songs would be closely linked.



Next, I want to take a look at how often certain tracks are repeated among the 500 closest pairs

In [ ]:
ax = sns.histplot(data=example.groupby(by=['predicted_track']).size(), shrink=2)
ax.set_title('How Often Are The Same Songs Recommened?')
ax.set_xlabel('Occurance')
ax.set_ylabel('Count')
ax
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f35da02ec50>
In [ ]:
df[df['track_uri'].isin(list(example.groupby(by=['predicted_track']).size().nlargest(1).index))][['artist', 'track']].head(1)
Out[ ]:
artist track
49212 Kane Brown What Ifs

Amoung these 500 closest pairs, the majority are only recommended once--the one song that appears at the recommendation for 8 other songs? Country artist Kane Brown's 2017 single 'What Ifs'.



Finally, we can look at the distribution of similarity scores among all of these recommended songs

In [ ]:
large_cuttoff = example['value'].min()
pred_tracks_df['is_large'] = pred_tracks_df['value'].apply(lambda x: 1 if x>large_cuttoff else 0)

ax = sns.histplot(data=pred_tracks_df, x='value', hue='is_large')
ax.set_title('Distribution of Chosen Song Similarity')
ax.set_xlabel('Similarity')
ax.set_ylabel('Count')
ax
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f35c2eb33d0>
In [ ]:
pred_tracks_df[pred_tracks_df['is_large'] == 1].sort_values('value').head(1)
Out[ ]:
pid predicted_track value is_large
1030379 2649 spotify:track:5Q0Nhxo0l2bP3pNjpGJwV1 0.990537 1

The values are higher than I would have expected-- most are clustered around 0.97 with very few scoring below 0.9 and all of the similarity scores for our 500 closest matches being above 0.990537.

In Conclusion

One potential avenue for enhanced track recommendations would be integration of additional content and context features like Wikipedia articles, track/album reviews, song lyrics, etc. In particular, text-based features could be parsed using any number of natural language processing (NLP) techniques, such as tf-idf or doc2vec cosine similarity. One noticeable deficiency in the Spotify API is the lack of track-specific genre information; incorporating this additional data may allow for the extrapolation of track genre, in addition to any number of other helpful features.

It may also be interesting to explore alternative metrics for evaluating the accuracies of track recommendations for playlist continuation. Options include empirical measures like normalized discounted cumulative gain (NDCG, a statistic that is commonly used to measure web search engine effectiveness), or more user-centric, albeit subjective, survey evaluations of recommendation quality.

If given more time, we might also pursue ensemble methods that combine content-based approaches with collaborative-based approaches. Together, both methods would ideally be able to compensate for the other's shortcomings and increase overall performance.

Finally, we would expand the scope of our project by considering all 1,000,000 playlists in official dataset and examine the effect on our success metric. All of these constitute just a few of the many possible extensions of our work. Overall, we found that a collaborative filtering approach yielded the highest R-score (14.2%), a somwehat surprising result given its focus on intra-playlist features rather than inter-playlist features.